72
Algorithms for Binary Neural Networks
for 1-bit CNNs as
LB = λ
2
L
l=1
Cl
o
i=1
Cl
i
n=1
||ˆkl,i
n −wl ◦kl,i
n ||2
2
+ ν(kl,i
n + −μl
i+)T (Ψl
i+)−1(kl,i
n + −μl
i+)
+ ν(kl,i
n −+ μl
i−)T (Ψl
i−)−1(kl,i
n −+ μl
i−)
+ ν log(det(Ψl))
+ θ
2
M
m=1
||fm −cm||2
2
+
Nf
n=1
σ−2
m,n(fm,n −cm,n)2 + log(σ2
m,n)
,
(3.108)
where kl,i
n , l ∈{1, ..., L}, i ∈{1, ..., Cl
o}, n ∈{1, ..., Cl
i}, is the vectorization of the i-th kernel
matrix at the l-th convolutional layer, wl is a vector used to modulate kl,i
n , and μl
i and Ψl
i
are the mean and covariance of the i-th kernel vector at the l-th layer, respectively. And
we term LB the Bayesian optimization loss. Furthermore, we assume that the parameters
in the same kernel are independent. Thus Ψl
i becomes a diagonal matrix with the identical
value (σl
i)2, where (σl
i)2 is the variance of the i-th kernel of the l-th layer. In this case, the
calculation of the inverse of Ψl
i is sped up, and all the elements of μl
i are identical and equal
to μl
i. Note that in our implementation, all elements of wl are replaced by their average
during the forward process. Accordingly, only a scalar instead of a matrix is involved in the
inference, and thus the computation is significantly accelerated.
After training 1-bit CNNs, Bayesian pruning loss LP is then used for the optimization
of feature channels, which can be written as:
LP =
L
l=1
Jl
j=1
Ij
i=1
||Kl
i,j −K
l
j||2
2
+ ν(Kl
i,j −K
l
j)T (Ψl
j)−1(Kl
i,j −K
l
j) + ν log
det(Ψl
j)
,
(3.109)
where Jl is the number of Gaussian clusters (groups) of the l-th layer, and Kl
i,j, i =
1, 2, ..., Ij, are those Kl
i’s that belong to the j-th group. In our implementation, we define
Jl = int(Cl
o × ϵ), where ϵ is a predefined pruning rate. In this chapter, we use one ϵ for all
layers. Note that when the j-th Gaussian just has one sample Kl
i,j,K
l
j = Kl
i,j and Ψj is a
unit matrix.
In BONNs, the cross-entropy loss LS, the Bayesian optimization loss LB, and the
Bayesian pruning loss LP are aggregated together to build the total loss as:
L = LS + LB + ζLP ,
(3.110)
where ζ is 0 in binarization training and becomes 1 in pruning. The loss of Bayesian kernels
constrains the distribution of the convolution kernels to a symmetric Gaussian mixture with
two modes. It simultaneously minimizes the quantization error through the ||ˆkl,i
n −wl◦kl,i
n ||2
2
term. Meanwhile, the Bayesian feature loss modifies the distribution of the features to reduce
intraclass variation for better classification. The Bayesian pruning loss converges kernels
similar to their means and thus compresses the 1-bit CNNs further.
3.7.5
Forward Propagation
In forward propagation, the binarized kernels and activations accelerate the convolution
computation. The reconstruction vector is essential for 1-bit CNNs as described in Eq. 3.97,